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Abstract 

We propose a novel deep neural network architecture for semi-supervised se¬ 
mantic segmentation using heterogeneous annotations. Contrary to existing ap¬ 
proaches posing semantic segmentation as a single task of region-based classifi¬ 
cation, our algorithm decouples classification and segmentation, and learns a sep¬ 
arate network for each task. In this architecture, labels associated with an image 
are identified by classification network, and binary segmentation is subsequently 
performed for each identified label in segmentation network. The decoupled ar¬ 
chitecture enables us to learn classification and segmentation networks separately 
based on the training data with image-level and pixel-wise class labels, respec¬ 
tively. It facilitates to reduce search space for segmentation effectively by exploit¬ 
ing class-specific activation maps obtained from bridging layers. Our algorithm 
shows outstanding performance compared to other semi-supervised approaches 
even with much less training images with strong annotations in PASCAL VOC 
dataset. 


1 Introduction 

Semantic segmentation is a technique to assign structured semantic labels—typically, object class 
labels—to individual pixels in images. This problem has been studied extensively over decades, 
yet remains challenging since object appearances involve significant variations that are potentially 
originated from pose variations, scale changes, occlusion, background clutter, etc. However, in spite 
of such challenges, the techniques based on Deep Neural Network (DNN) demonstrate impressive 
performance in the standard benchmark datasets such as PASCAL VOC (T). 

Most DNN-based approaches pose semantic segmentation as pixel-wise classification problem [2] 
El SI 0 0. Although these approaches have achieved good performance compared to previous 
methods, training DNN requires a large number of segmentation ground-truths, which result from 
tremendous annotation efforts and costs. For this reason, reliable pixel-wise segmentation annota¬ 
tions are typically available only for a small number of classes and images, which makes supervised 
DNNs difficult to be applied to semantic segmentation tasks involving various kinds of objects. 

Semi- or weakly-supervised learning approaches OIL 9]| 10] alleviate the problem in lack of training 
data by exploiting weak label annotations per bounding box 110. 8| or image mmm. They often 
assume that a large set of weak annotations is available during training while training examples with 
strong annotations are missing or limited. This is a reasonable assumption because weak annotations 
such as class labels for bounding boxes and images require only a fraction of efforts compared to 
strong annotations, i.e., pixel-wise segmentations. The standard approach in this setting is to update 
the model of a supervised DNN by iteratively inferring and refining hypothetical segmentation la¬ 
bels using weakly annotated images. Such iterative techniques often work well in practice mm, 
but training methods rely on ad-hoc procedures and there is no guarantee of convergence; imple¬ 
mentation may be tricky and the algorithm may not be straightforward to reproduce. 
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We propose a novel decoupled architecture of DNN appropriate for semi-supervised semantic seg¬ 
mentation, which exploits heterogeneous annotations with a small number of strong annotations— 
full segmentation masks—as well as a large number of weak annotations—object class labels per 
image. Our algorithm stands out from the traditional DNN-based techniques because the architec¬ 
ture is composed of two separate networks; one is for classification and the other is for segmentation. 
In the proposed network, object labels associated with an input image are identified by classification 
network while figure-ground segmentation of each identified label is subsequently obtained by seg¬ 
mentation network. Additionally, there are bridging layers, which deliver class-specific information 
from classification to segmentation network and enable segmentation network to focus on the single 
label identified by classification network at a time. 

Training is performed on each network separately, where networks for classification and segmenta¬ 
tion are trained with image-level and pixel-wise annotations, respectively; training does not require 
iterative procedure, and algorithm is easy to reproduce. More importantly, decoupling classification 
and segmentation reduces search space for segmentation significantly, which makes it feasible to 
train the segmentation network with a handful number of segmentation annotations. Inference in 
our network is also simple and does not involve any post-processing. Extensive experiments show 
that our network substantially outperforms existing semi-supervised techniques based on DNNs 
even with much smaller segmentation annotations, e.g., 5 or 10 strong annotations per class. 

The rest of the paper is organized as follows. We briefly review related work and introduce overall 
algorithm in Section[2] and [3] respectively. The detailed configuration of the proposed network is 
described in Section 4j and training algorithm is presented in Section [5] Section [6] presents experi¬ 
mental results on a challenging benchmark dataset. 


2 Related Work 

Recent breakthrough in semantic segmentation are mainly driven by supervised approaches rely¬ 
ing on Convolutional Neural Network (CNN) 12 Ell 130. Based on CNNs developed for image 
classification, they train networks to assign semantic labels to local regions within images such 
as pixels (21313 or superpixels mM. Notably, Long et al. m propose an end-to-end system 
for semantic segmentation by transforming a standard CNN for classification into a fully convo¬ 
lutional network. Later approaches improve segmentation accuracy through post-processing based 
on fully-connected CRL [3 Till . Another branch of semantic segmentation is to learn a multi-layer 
deconvolution network, which also provides a complete end-to-end pipeline (121 . However, training 
these networks requires a large number of segmentation ground-truths, but the collection of such 
dataset is a difficult task due to excessive annotation efforts. 

To mitigate heavy requirement of training data, weakly-supervised learning approaches start to draw 
attention recently. In weakly-supervised setting, the models for semantic segmentation have been 
trained with only image-level labels GUIDES or bounding box class labels ED. Given weakly 
annotated training images, they infer latent segmentation masks based on Multiple Instance Learning 
(MIL) (301 or Expectation-Maximization (EM) 0 framework based on the CNNs for supervised 
semantic segmentation. However, performance of weakly supervised learning approaches except 
Col is substantially lower than supervised methods, mainly because there is no direct supervision for 
segmentation during training. Note that ED requires bounding box annotations as weak supervision, 
which are already pretty strong and significantly more expensive to acquire than image-level labels. 

Semi-supervised learning is an alternative to bridge the gap between fully- and weakly-supervised 
learning approaches. In the standard semi-supervised learning framework, given only a small num¬ 
ber of training images with strong annotations, one needs to infer the full segmentation labels for the 
rest of the data. However, it is not plausible to learn a huge number of parameters in deep networks 
reliably in this scenario. Instead, [8 , 10] train the models based on heterogeneous annotations—a 
large number of weak annotations as well as a small number strong annotations. This approach is 
motivated from the facts that the weak annotations, i.e., object labels per bounding box or image, is 
much more easily accessible than the strong ones and that the availability of the weak annotations 
is useful to learn a deep network by mining additional training examples with full segmentation 
masks. Based on supervised CNN architectures, they iteratively infer and refine pixel-wise seg¬ 
mentation labels of weakly annotated images with guidance of strongly annotated images, where 
image-level labels co and bounding box annotations Col are employed as weak annotations. They 
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Figure 1: The architecture of the proposed network. While classification and segmentation networks 
are decoupled, bridging layers deliver critical information from classification network to segmenta¬ 
tion network. 


claim that exploiting few strong annotations substantially improves the accuracy of semantic seg¬ 
mentation while it reduces annotations efforts for supervision significantly. However, they rely on 
iterative training procedures, which are often ad-hoc and heuristic and increase complexity to repro¬ 
duce results in general. Also, these approaches still need a fairly large number of strong annotations 
to achieve reliable performance. 

3 Algorithm Overview 

Figure [T] presents the overall architecture of the proposed network. Our network is composed of 
three parts: classification network, segmentation network and bridging layers connecting the two 
networks. In this model, semantic segmentation is performed by separate but successive operations 
of classification and segmentation. Given an input image, classification network identifies labels 
associated with the image, and segmentation network produces pixel-wise figure-ground segmen¬ 
tation corresponding to each identified label. This formulation may suffer from loose connection 
between classification and segmentation, but we figure out this challenge by adding bridging layers 
between the two networks and delivering class-specific information from classification network to 
segmentation network. Then, it is possible to optimize the two networks using separate objective 
functions while the two decoupled tasks collaborate effectively to accomplish the final goal. 

Training our network is very straightforward. We assume that a large number of image-level annota¬ 
tions are available while there are only a few training images with segmentation annotations. Given 
these heterogeneous and unbalanced training data, we first learn the classification network using 
rich image-level annotations. Then, with the classification network fixed, we jointly optimize the 
bridging layers and the segmentation network using a small number of training examples with strong 
annotations. There are only a small number of strongly annotated training data, but we alleviate this 
challenging situation by generating many artificial training examples through data augmentation. 

The contributions and characteristics of the proposed algorithm are summarized below: 

• We propose a novel DNN architecture for semi-supervised semantic segmentation using het¬ 
erogeneous annotations. The new architecture decouples classification and segmentation tasks, 
which enables us to employ pre-trained models for classification network and train only seg¬ 
mentation network and bridging layers using a few strongly annotated data. 

• The bridging layers construct class-specific activation maps, which are delivered from classifi¬ 
cation network to segmentation network. These maps provide strong priors for segmentation, 
and reduce search space dramatically for training and inference. 

• Overall training procedure is very simple since two networks are to be trained separately. Our 
algorithm does not infer segmentation labels of weakly annotated images through iterative 
heuristic^] which are common in semi-supervised learning techniques (8 , 10J. 


'Due to this property, our framework is different from standard semi-supervised learning but close to few- 
shot learning with heterogeneous annotations. Nonetheless, we refer to it as semi-supervised learning in this 
paper since we have a fraction of strongly annotated data in our training dataset but complete annotations of 
weak labels. Note that our level of supervision is similar to the semi-supervised learning case in [81. 
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The proposed algorithm provides a concept to make up for the lack of strongly annotated training 
data using a large number of weakly annotated data. This concept is interesting because the assump¬ 
tion about the availability of training data is desirable for real situations. We estimate figure-ground 
segmentation maps only for the relevant classes identified by classification network, which improves 
scalability of algorithm in terms of the number of classes. Finally, our algorithm outperforms the 
comparable semi-supervised learning method with substantial margins in various settings. 

4 Architecture 

This section describes the detailed configurations of the proposed network, including classification 
network, segmentation network and bridging layers between the two networks. 

4.1 Classification Network 

The classification network takes an image x as its input, and outputs a normalized score vector 
S(x; 6 C ) G R l representing a set of relevance scores of the input x based on the trained classification 
model 6 C for predefined L categories. The objective of classification network is to minimize error 
between ground-truths and estimated class labels, and is formally written as 

min c (y i ,5(x i ;6> c )), (1) 

C'c 

l 

where G {0, 1} L denotes the ground-truth label vector of the i- th example and e c (y^, S'(x i ; 6 C )) 
is classification loss of 5'(x^; 6 C ) with respect to y^. 

We employ VGG 16-layer net ca as the base architecture for our classification network. It con¬ 
sists of 13 convolutional layers, followed by rectification and optional pooling layers, and 3 fully 
connected layers for domain-specific projection. Sigmoid cross-entropy loss function is employed 
in Eq. 0, which is a typical choice in multi-class classification tasks. 

Given output scores S^x^; 6 C ), our classification network identifies a set of labels Ci associated with 
input image x^. The region in x^ corresponding to each label / G Ci is predicted by the segmentation 
network discussed next. 

4.2 Segmentation Network 

The segmentation network takes a class-specific activation map g- of input image x^, which is ob¬ 
tained from bridging layers, and produces a two-channel class-specific segmentation map M (g-; 6 S ) 
after applying softmax function, where 6 S is the model parameter of segmentation network. Note 
that M(g-;# s ) has foreground and background channels, which are denoted by M/(g-;# s ) and 
Mfr(g-; 0 S ), respectively. The segmentation task is formulated as per-pixel regression to ground- 
truth segmentation, which minimizes 

m in £e s (z',M(g*;6> s )), (2) 

& s 

where z \ denotes the binary ground-truth segmentation mask for category l of the i -th image x z and 
e s (zi, M( g-; 6 S )) is the segmentation loss of Mf( g-; 6 S )—or equivalently the segmentation loss of 
Mb( 6 S ) —with respect to z\. 

The recently proposed deconvolution network [12 ] is adopted for our segmentation network. Given 
an input activation map g- corresponding to input image x ?: , the segmentation network generates 
a segmentation mask in the same size to x 2 by multiple series of operations of unpooling, decon¬ 
volution and rectification. Unpooling is implemented by importing the switch variable from every 
pooling layer in the classification network, and the number of deconvolutional and unpooling layers 
are identical to the number of convolutional and pooling layers in the classification network. We 
employ the softmax loss function to measure per-pixel loss in Eq. ([2]). 

Note that the objective function in Eq. ([2} corresponds to pixel-wise binary classification; it infers 
whether each pixel belongs to the given class l or not. This is the major difference from the existing 
networks for semantic segmentation including m, which aim to classify each pixel to one of the 
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L predefined classes. By decoupling classification from segmentation and posing the objective 
of segmentation network as binary classification, our algorithm reduces the number of parameters 
in the segmentation network significantly. Specifically, this is because we identify the relevant 
labels using classification network and perform binary segmentation for each of the labels, where the 
number of output channels in segmentation network is set to two—for foreground and background— 
regardless of the total number of candidate classes. This property is especially advantageous in our 
challenging scenario, where only a few pixel-wise annotations (typically 5 to 10 annotations per 
class) are available for training segmentation network. 


4.3 Bridging Layers 


To enable the segmentation network described in Section 42 to produce the segmentation mask of 
a specific class, the input to the segmentation network should involve class-specific information as 
well as spatial information required for shape generation. To this end, we have additional layers 
underneath segmentation network, which is referred to as bridging layers, to construct the class- 
specific activation map g \ for each identified label l G 


To encode spatial configuration of objects presented in image, we exploit outputs from an intermedi¬ 
ate layer in the classification network. We take the outputs from the last pooling layer (pool5) since 
the activation patterns of convolution and pooling layers often preserve spatial information effec¬ 
tively while the activations in the higher layers tend to capture more abstract and global information. 
We denote the activation map of pool5 layer by f spat afterwards. 

Although activations in f spat maintain useful information for shape generation, they contain mixed 
information of all relevant labels in x* and we should identify class-specific activations in f spat ad¬ 
ditionally. For the purpose, we compute class-specific saliency maps using the back-propagation 
technique proposed in fbfll . Let fM be the output of the i -th layer (i = 1,..., M) in the classifica¬ 
tion network. The relevance of activations in f 1 ^ with respect to a specific class l is computed by 
chain rule of partial derivative, which is similar to error back-propagation in optimization, as 


= 9S l 

cls <9f(*0 


<9f( M ) Qf( M - 1 ) 
Qf(M-l) Qf(M-2) 


<9fO+l) 
<9f(*0 ’ 


( 3 ) 


where f c z ls denotes class-specific saliency map and Si is the classification score of class /. Intuitively, 
Eq. ([3]) means that the values in f c z ls depend on how much the activations in f W are relevant to class 
l; this is measured by computing the partial derivative of class score Si with respect to the activations 
in f( fe ). We back-propagate the class-specific information until pool5 layer. 

The class-specific activation map g- is obtained by combining both f spat and f c z ls . We first concatenate 
f spat and f c z ls in their channel direction, and forward-propagate it through the fully-connected bridging 
layers, which discover the optimal combination of f spat and f c z ls using the trained weights. The 
resultant class-specific activation map g \ that contains both spatial and class-specific information is 
given to segmentation network to produce a class-specific segmentation map. Note that the changes 
in g \ depend only on f c \ s since f spat is fixed for all classes in an input image. 

Figure[2]visualizes the examples of class-specific activation maps g \ obtained from several validation 
images. The activations from the images in the same class share similar patterns despite substantial 
appearance variations, which shows that the outputs of bridging layers capture class-specific infor¬ 
mation effectively; this property makes it possible to obtain figure-ground segmentation maps for 
individual relevant classes in segmentation network. More importantly, it reduces the variations of 
input distributions for segmentation network, which allows to achieve good generalization perfor¬ 
mance in segmentation even with a small number of training examples. 

For inference, we compute a class-specific activation map g \ for each identified label l G Ci and ob¬ 
tain class-specific segmentation maps {M(g-; 0 s )}vze/V I n addition, we obtain M(g*; 0 S ), where 
g* is the activation map from the bridging layers for all identified labels. The final label estimation 
is given by identifying the label with the maximum score in each pixel out of {M/(g-; 0 s )}\/i e a 
and Mfr(g*; 6 S ). Figure [^illustrates the output segmentation map of each g- for x$, where each map 
identifies high response area given g \ successfully. 
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Figure 2: Examples of class-specific activation maps (output of bridging layers). Despite significant 
variations in input images, the class-specific activation maps share similar properties, which suggests 
that the search space for segmentation may not be as huge as our prejudice. 
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Figure 3: Input image (left) and its segmentation maps (right) of individual classes. 


5 Training 

In our semi-supervised learning scenario, we have mixed training examples with weak and strong 
annotations. Let W = {1,..., N w } and S = {1,..., N s } denote the index sets of images with image- 
level and pixel-wise class labels, respectively, where N w N s . We first train the classification 
network using the images in W by optimizing the loss function in Eq. 0. Then, fixing the weights 
in the classification network, we jointly train the bridging layers and the segmentation network using 
images in S by optimizing Eq. 0. For training segmentation network, we need to obtain class- 
specific activation map g- from bridging layers using ground-truth class labels associated with x^, 
i G S. Note that we can reduce complexity in training by optimizing the two networks separately. 

Although the proposed algorithm has several advantages in training segmentation network with few 
training images, it would still be better to have more training examples with strong annotations. 
Hence, we propose an effective data augmentation strategy, combinatorial cropping. Let £* denotes 
a set of ground-truth labels associated with image x^, i G S. We enumerate all possible combina¬ 
tions of labels in P(£*), where P(£*) denotes the powerset of £*. For each V G P(£*) except 
empty set (V ^ 0), we construct a binary ground-truth segmentation mask zf by setting the pixels 
corresponding to every label 1 G Pas foreground and the rests as background. Then, we generate N p 
sub-images enclosing the foreground areas based on region proposal method m and random sam¬ 
pling. Through this simple data augmentation technique, we have N t = N s + N p • — l) 

training examples with strong annotations effectively, where N t N s . 

6 Experiments 

6.1 Implementation Details 

Dataset We employ PASCAL VOC 2012 dataset (2 for training and testing of the proposed deep 
network. The dataset with extended annotations from EH, which contains 12,031 images with 
pixel-wise class labels, is used in our experiment. To simulate semi-supervised learning scenario, we 
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Table 1: Evaluation results on PASCAL VOC 2012 validation set. 


# of strongs 

DecoupledNet 

WSSL-Small FoV O 

WSSL-Large-FoV OD 

DecoupledNet-Str 

DeconvNet [IT] 

Full 

67.5 

63.9 

67.6 

67.5 

67.1 

25 (x20 classes) 

62.1 

56.9 

54.2 

50.3 

38.6 

10 (x20 classes) 

57.4 

47.6 

38.9 

41.7 

21.5 

5 (x 20 classes) 

53.1 

- 

- 

32.7 

15.3 


Table 2: Evaluation results on PASCAL VOC 2012 test set. 


Models 

bkg areo bike bird boat bottle bus car cat chair cow table dog horse mbk prsn pint sheep sofa train tv 

mean 

DecoupledNet-Full 

91.5 78.8 39.9 78.1 53.8 68.3 83.2 78.2 80.6 25.8 62.6 55.5 75.1 77.2 77.1 76.0 47.8 74.1 47.5 66.4 60.4 

66.6 

DecoupledNet-25 
DecoupledNet-10 
DecoupledNet-5 

90.1 75.8 41.7 70.4 46.4 66.2 83.0 69.9 76.7 23.1 61.2 43.3 70.4 75.7 74.1 65.7 46.2 73.8 39.7 61.9 57.6 

88.5 73.8 40.1 68.1 45.5 59.5 76.4 62.7 71.4 17.7 60.4 39.9 64.5 73.0 68.5 56.0 43.4 70.8 37.8 60.3 54.2 

87.4 70.4 40.9 60.4 36.3 61.2 67.3 67.7 64.6 12.8 60.2 26.4 63.2 69.6 64.8 53.1 34.7 65.3 34.4 57.0 50.5 

62.5 

58.7 

54.7 


divide the training images into two non-disjoint subsets —W with weak annotations only and S with 
strong annotations as well. There are 10,582 images with image-level class labels, which are used to 
train our classification network. We also construct training datasets with strongly annotated images; 
the number of images with segmentation labels per class is controlled to evaluate the impact of 
supervision level. In our experiment, three different cases—5, 10, or 25 training images with strong 
annotations per class—are tested to show the effectiveness of our semi-supervised framework. We 
evaluate the performance of the proposed algorithm on 1,449 validation images. 

Data Augmentation We employ different strategies to augment training examples in the two 
datasets with weak and strong annotations. For the images with weak annotations, simple data aug¬ 
mentation techniques such as random cropping and horizontal flipping are employed as suggested in 
(ED- We perform combinatorial cropping proposed in Section [5] for the images with strong annota¬ 
tions, where EdgeBox m is adopted to generate region proposals and the N p (= 200) sub-images 
are generated for each label combination. 

Optimization We implement the proposed network based on Caffe library [17]. The standard 
Stochastic Gradient Descent (SGD) with momentum is employed for optimization, where all pa¬ 
rameters are identical to m . We initialize the weights of the classification network using VGG 
16-layer net pre-trained on ILSVRC CD dataset. When we train the deep network with full annota¬ 
tions, the network converges after approximately 5.5K and 17.5K SGD iterations with mini-batches 
of 64 examples in training classification and segmentation networks, respectively; training takes 3 
days (0.5 day for classification network and 2.5 days for segmentation network) in a single Nvidia 
GTX Titan X GPU with 12G memory. Note that training segmentation network is much faster in 
our semi-supervised setting while there is no change in training time of classification network. 

6.2 Results on PASCAL VOC Dataset 

Our algorithm denoted by DecoupledNet is compared with two variations in WSSL 0, which is an¬ 
other algorithm based on semi-supervised learning with heterogeneous annotations. We also test the 
performance of DecoupledNet-Sti[^] and DeconvNet CE2, which only utilize examples with strong 
annotations, to analyze the benefit of image-level weak annotations. All learned models in our ex¬ 
periment are based only on the training set (not including the validation set) in PASCAL VOC 2012 
dataset. All algorithms except WSSL 0 report the results without CRF. Segmentation accuracy is 
measured by Intersection over Union (IoU) between ground-truth and predicted segmentation, and 
the mean IoU over 20 semantic categories is employed for the final performance evaluation. 

Table [I] summarizes quantitative results on PASCAL VOC 2012 validation set. Given the same 
amount of supervision, DecoupledNet presents substantially better performance even without any 
post-processing than WSSL [ 81], which is a directly comparable method. In particular, our algo¬ 
rithm has great advantage over WSSL when the number of strong annotations is extremely small. 
We believe that this is because DecoupledNet reduces search space for segmentation effectively by 
employing the bridging layers and the deep network can be trained with a smaller number of images 

2 This is identical to DecoupledNet except that its classification and segmentation networks are trained with 
the same images, where image-level weak annotations are generated from pixel-wise segmentation annotations. 
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Figure 4: Semantic segmentation results of several PASCAL VOC 2012 validation images based 
on the models trained on a different number of pixel-wise segmentation annotations. 


with strong annotations consequently. Our results are even more meaningful since training proce¬ 
dure of DecoupledNet is very straightforward compared to WSSL and does not involve heuristic 
iterative procedures, which are common in semi-supervised learning methods. 

When there are only a small number of strongly annotated training data, our algorithm obviously 
outperforms DecoupledNet-Str and DeconvNet lfl2l by exploiting the rich information of weakly an¬ 
notated images. It is interesting that DecoupledNet-Str is clearly better than DeconvNet, especially 
when the number of training examples is small. For reference, the best accuracy of the algorithm 
based only on the examples with image-level labels is 42.0% Q, which is much lower than our 
result with five strongly annotated images per class, even though (3 requires significant efforts for 
heuristic post-processing. These results show that even little strong supervision can improve seman¬ 
tic segmentation performance dramatically. 

Table [2] presents more comprehensive results of our algorithm in PASCAL VOC test set. Our algo¬ 
rithm works well in general and approaches to the empirical upper-bound fast with a small number 
of strongly annotated images. A drawback of our algorithm is that it does not achieve the state-of- 
the-art performance 0 [lTJ [12 1 when the (almos0 full supervision is provided in PASCAL VOC 
dataset. This is probably because our method optimizes classification and segmentation networks 
separately although joint optimization of two objectives is more desirable. However, note that our 
strategy is more appropriate for semi-supervised learning scenario as shown in our experiment. 

Figure [4] presents several qualitative results from the proposed algorithm. Note that our model 
trained only with five strong annotations per class already shows good generalization performance, 
and that more training examples with strong annotations improve segmentation accuracy and reduce 
label confusions substantially. Refer to our project websit^] for more comprehensive qualitative 
evaluation. 

7 Conclusion 

We proposed a novel deep neural network architecture for semi-supervised semantic segmentation 
with heterogeneous annotations, where classification and segmentation networks are decoupled for 
both training and inference. The decoupled network is conceptually appropriate for exploiting het¬ 
erogeneous and unbalanced training data with image-level class labels and/or pixel-wise segmen¬ 
tation annotations, and simplifies training procedure dramatically by discarding complex iterative 
procedures for intermediate label inferences. Bridging layers play a critical role to reduce output 

3 We did not include the validation set for training and have less training examples than the competitors. 

4 http://cvlab.postech.ac.kr/research/decouplednet/ 
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space of segmentation, and facilitate to learn segmentation network using a handful number of seg¬ 
mentation annotations. Experimental results validate the effectiveness of our decoupled network, 
which outperforms existing semi- and weakly-supervised approaches with substantial margins. 
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